This article described the framework and highlighted my steps to develop a Machine Learning model with ensemble learning for predicting hospital readmissions. In Part 1, I detailed the process to clean up and prepare the dataset, Diabetes 130-US hospitals for years 1999-2008 Data Set, downloaded from UCI Machine Learning Repository for Machine Learning.
Notice that the data set employed for developing the ensemble described in this article was slightly different from the finalized data set in Part 1. Nevertheless, the process for preparing the data set was very much identical other than the feature set was based on results from a different Boruta (CRAN) run.
The original data set downloaded and visualized in Part 1 had 101766 observations of 49 variable. After preparation, the resulted data set imported here had 70245 observations of 27 variables. Here’s the structure.
# index column removed
str(( rmd.imported.data.set <- read.csv( rmd.info[['imported.file']] )[-1] ))
## 'data.frame': 70245 obs. of 27 variables:
## $ race : Factor w/ 5 levels "AfricanAmerican",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ gender : Factor w/ 2 levels "Femals","Male": 1 2 2 1 1 1 1 1 1 1 ...
## $ age : num 1.5 1.5 1 2 1.5 1.5 1.5 1.5 1 2 ...
## $ admission_type_id : Factor w/ 2 levels "k","u": 1 1 2 2 2 2 2 1 2 2 ...
## $ discharge_disposition_id: Factor w/ 4 levels "d","h","o","u": 1 1 4 1 4 4 4 1 1 4 ...
## $ admission_source_id : Factor w/ 5 levels "b","o","r","t",..: 3 2 3 3 2 2 3 4 3 2 ...
## $ time_in_hospital : num 2.92 2.92 2.92 2.92 2.92 ...
## $ num_lab_procedures : num -0.80047 1.70905 -0.148 0.00257 1.35772 ...
## $ num_procedures : num 0.857 -0.271 -0.271 0.293 0.857 ...
## $ num_medications : num 1.359 0.31 0.194 -0.506 1.942 ...
## $ number_outpatient : num -0.275 -0.275 -0.275 -0.275 -0.275 ...
## $ number_emergency : num -0.194 -0.194 -0.194 -0.194 -0.194 ...
## $ number_inpatient : num -0.394 -0.394 -0.394 -0.394 -0.394 ...
## $ diag_1 : Factor w/ 9 levels "Circulatory",..: 7 1 6 2 3 9 1 1 2 1 ...
## $ diag_2 : Factor w/ 9 levels "Circulatory",..: 7 2 5 2 9 1 5 1 2 1 ...
## $ diag_3 : Factor w/ 9 levels "Circulatory",..: 3 2 2 1 1 2 1 1 3 6 ...
## $ number_diagnoses : num 0.341 0.341 -1.679 -2.185 0.846 ...
## $ A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 4 3 4 1 3 3 3 1 3 ...
## $ metformin : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 3 2 2 2 3 2 2 ...
## $ glipizide : Factor w/ 4 levels "Down","No","Steady",..: 3 2 3 4 2 2 3 3 2 2 ...
## $ glyburide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ rosiglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ insulin : Factor w/ 4 levels "Down","No","Steady",..: 1 4 2 2 4 1 3 3 2 3 ...
## $ change : Factor w/ 2 levels "Ch","No": 1 1 1 1 1 1 1 1 2 2 ...
## $ diabetesMed : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 1 2 ...
## $ readmitted : int 0 1 0 0 0 0 1 0 0 0 ...
In this project, I used a subset of the imported data set for developing an ensemble. With sufficient computing resources, I would have used 100% of the imported data set instead. Using my i7 laptop, for a few thousand observations, the training would have taken quite a few hours. With a smaller data set, it made the productivity more manageable.
## 5 % of the imported data = 0.05 x 70245 obs. = 3512 obs.
I separated data into three partitions as the following:
## Training ( 0.6 ) = sampling 60 % of 3512 obs. = 2141 obs.
## Testing ( 0.2 ) = smapling 20 % of 3512 obs. = 716 obs.
## Training ( 0.2 ) = smapling 20 % of 3512 obs. = 655 obs.
While examining the training data, I noticed the label, readmitted, was with highly disproportional distribution of values.
table(rmd.train.org.readmitted)
## rmd.train.org.readmitted
## 0 1
## 1863 278
This was problematic, as class imbalance tends to overwhelm a model and leads to incorrect classification. Since during training, the model would have learned much more about zeros and become prone to classifying the label as ‘not admitted’, on the other hand known little about the situations of ones and how to predict ‘readmitted’ scenarios. Consequently, a trained model would predict potentially with a high misclassification rate.
To circumvent the imbalance issue, I used SMOTE to generate a “SMOTE’d” set of values based on the label values. The following code illustrated using SMOTE from the package, Data Mining with R(DMwR) to generate a data set for training, testing and cross-validation.
Notice the balance between oversampling and undersampling are configurable with perc.over and perc.under as detailed in the documentation.
table(rmd.train.balanced.readmitted)
## rmd.train.balanced.readmitted
## 0 1
## 778 834
Now the training data was ready. I started building an ensemble.
For a complex problem like hospital readmissions, realizing and optimizing biases-variance tradeoff is a challenge. And using ensemble learning to complement some algorithms’ weakness with the others’ strength by evaluating, weighting, combining, and optimizing their results seemed a right strategy and logical approach. The following illustrated the concept of ensemble learning and additional information is available at the source.
For constructing an ensemble model, I used SuperLearner which provides a user-friendly framework and essentially eliminates the mechanics for orchestrating cross validation of each algorithm from an ensemble developer. Here’s a tutorial to show how SuperLearner works. Notice that SuperLearner is a wrapper and algorithms included in ensemble must be installed first and required as libraries for invoking such algorithms.
Hospital readmissions is a classification problem, since a patient is either readmitted or not. There have been a few algorithms known for solving classification problems including RandomForest, ranger, Gradient Boosting, xgboost, Elastic Net, Support Vector Machines, Neural Network (nnet), etc. To develop ensemble learning, one important task at this time was to investigate and select a set of algorithms, or learners, complementary to one another to make predictions with high accuracy. Incidentally, In the context of ensemble learning, an employed algorithm is also called a ‘learner’ and they are used interchangeably for the remaining of the article.
Using SuperLearner, I first gathered candidate learners to form an ensemble and generated some preliminary results, assessed the baseline performance, and resolved operability and compatibility issues, if any. At the same time, ran the ensemble with tuning parameters or test grids to find optimal settings for individual learners. After a series of testing, I eventually settled the ensemble with two learners, ranger and xgboost, and the following set of parameters.
To create a test grid, SuperLearner provided the API, create.Learner. As demonstrated below, for each algorithm to be called by SuperLearner, I defined a range/set of values. create.Learner would reference/associate the settings with the algorithm and generate the function names associated with referenced parameters and values.
#--------------
# HYPERAMETERS
#--------------
xgboost.custom <- create.Learner('SL.xgboost'
,tune=list( # default: ntree=500, depth=4, shrinkage=0.1, minobs=10
ntrees=c(500,1000) ,max_depth=4
,shrinkage=c(0.001,0.01,0.1) ,minobspernode=c(10)
)
,detailed_names = TRUE ,name_prefix = 'xgboost'
)
ranger.custom <- create.Learner('SL.ranger'
,tune = list( # default: num.trees=1000, mtry=sqrt(variables) or var/3
num.trees = c(1000,2000)
,mtry = floor(sqrt(ncol(x.train))*c(1,2))
)
,detailed_names = TRUE ,name_prefix = 'ranger'
)
And create.Learer generated functions accordingly based on combinations of the above settings as indicated by each function name:
# Learners for ensemble learning
rmd.ranger.custom$names
## [1] "ranger_1000_5" "ranger_2000_5" "ranger_1000_10" "ranger_2000_10"
rmd.xgboost.custom$names
## [1] "xgboost_500_4_0.001_10" "xgboost_1000_4_0.001_10"
## [3] "xgboost_500_4_0.01_10" "xgboost_1000_4_0.01_10"
## [5] "xgboost_500_4_0.1_10" "xgboost_1000_4_0.1_10"
To form an ensemble, I included the function names generated by create.Learner as a library reference for SuperLearner. Each function was a learner i.e. a designated algorithm with configured settings, and a member of the ensemble.
# Ensemble
rmd.SL.algorithm <- c( rmd.ranger.custom$names, rmd.xgboost.custom$names )
There appeared quite a few packages for enabling parallel processing in R, most are wrappers and ultimately fell in either ‘multicore’ or ‘snow’ system. And the package, parallel, was essential and what I used. My computing environment was Windows, therefore SNOW clusters were applicable. To minimize the complexity, instead of among multiple hosts I ran parallel processing in a single host using multiple CPUs of my laptop.
The following is a sample configuration for training an ensemble in parallel. For demonstration, it ran all three SuperLearner evaluation methods, namely NNLS, AUC and NNloglik.
#-------------------------------
# MULTICORE/PARALLEL PROCESSING
#-------------------------------
if (!require('parallel')) install.packages('parallel'); library(parallel)
cl <- makeCluster(detectCores()-1)
clusterExport(cl, c( listWrappers()
,'SuperLearner','CV.SuperLearner','predict.SuperLearner','cvControl'
,'x.train','y.train','x.test','y.test','x.hold','y.hold'
,'family','nnls','auc','nnloglik' ,'save.dir'
,'rmd.SL.algorithm',rmdranger.custom$names ,rmd.xgboost.custom$names
))
clusterSetRNGStream(cl, iseed=135)
# Load libraries on workers
clusterEvalQ(cl, {
library(SuperLearner);library(caret)
library(ranger);library(xgboost)
})
clusterEvalQ(cl, {
ensem.auc <- SuperLearner(Y=y.train ,X=x.train #,verbose=TRUE
,family=family,method=auc,SL.library=rmd.SL.algorithm
,cvControl=list(V=cvControl)
)
ensem.nnloglik <- SuperLearner(Y=y.train ,X=x.train #,verbose=TRUE
,family=family,method=nnloglik,SL.library=rmd.SL.algorithm
,cvControl=list(V=cvControl)
)
ensem.nnls <- SuperLearner(Y=y.train ,X=x.train #,verbose=TRUE
,family=family,method=nnls,SL.library=rmd.SL.algorithm
,cvControl=list(V=cvControl)
)
})
system.time({
ensem.cv.auc <- CV.SuperLearner( Y=y.hold ,X=x.hold #,verbose=TRUE
,cvControl=list(V=cvControl),innerCvControl=list(list(V=cvControl-1))
,family=family ,method=auc ,SL.library=rmd.SL.algorithm ,parallel=cl
)
})
system.time({
ensem.cv.nnls <- CV.SuperLearner(Y=y.hold ,X=x.hold #,verbose=TRUE
,cvControl=list(V=cvControl),innerCvControl=list(list(V=cvControl-1))
,family=family ,method=nnls ,SL.library=rmd.SL.algorithm ,parallel=cl
)
})
system.time({
ensem.cv.nnloglik <- CV.SuperLearner( Y=y.hold ,X=x.hold #,verbose=TRUE
,cvControl=list(V=cvControl),innerCvControl=list(list(V=cvControl-1))
,family=family ,method=nnloglik ,SL.library=rmd.SL.algorithm ,parallel=cl
)
})
stopCluster(cl)
SuperLearn evaluated the ensemble based on “risk” instead of “loss.” Both loss and risk measure how well a model fits data. To oversimplify the technical details, consider that “loss” measures a model performance against training data and “risk” is an average measure of loss, or expected loss, across unseen data, i.e. test and validation data.
SuperLearner provided three methods for evaluating risks associated with predictions made by a learner included in an ensemble model, then derived and assigned a coefficient/weight/importance to the algorithm accordingly. The methods were NNLS, AUC, and NNloglik, as described on page 6 of the document.
Based on the specified evaluation method, here nnls, SuperLearner ranked the importance of each learner by deriving and assigning a coefficient or weight, as illustrated by the following plot. And the smaller the coefficient, the less on average the algorithm would be expected to contribute to a prediction made by the ensemble. Here, a coefficient with zero indicated SuperLearner’s removal of the associated algorithm from an associated ensemble.
learner.risk
## nnls.cvRisk nnls.coef auc.cvRisk
## ranger_1000_5_All 0.1366549 0.3244972 0.1006000
## ranger_2000_5_All 0.1368446 0.0000000 0.1005699
## ranger_1000_10_All 0.1337540 0.3691810 0.1028917
## ranger_2000_10_All 0.1341303 0.0000000 0.1029857
## xgboost_500_4_0.001_10_All 0.2064826 0.0000000 0.1951100
## xgboost_1000_4_0.001_10_All 0.1856849 0.0000000 0.1766566
## xgboost_500_4_0.01_10_All 0.1447884 0.0000000 0.1409505
## xgboost_1000_4_0.01_10_All 0.1407350 0.0000000 0.1363177
## xgboost_500_4_0.1_10_All 0.1418242 0.0000000 0.1316602
## xgboost_1000_4_0.1_10_All 0.1453239 0.3063218 0.1303194
learner.coef
## auc.coef nnloglik.cvRisk nnloglik.coef
## ranger_1000_5_All 0.252341546 0.4339163 0.0000000
## ranger_2000_5_All 0.234113550 0.4329543 0.7249833
## ranger_1000_10_All 0.190861287 0.4174555 0.1446958
## ranger_2000_10_All 0.175185781 0.4175521 0.0000000
## xgboost_500_4_0.001_10_All 0.068448460 0.6065607 0.0000000
## xgboost_1000_4_0.001_10_All 0.002082963 0.5578090 0.0000000
## xgboost_500_4_0.01_10_All 0.003134338 0.4502041 0.0000000
## xgboost_1000_4_0.01_10_All 0.000000000 0.4336628 0.0000000
## xgboost_500_4_0.1_10_All 0.004156602 0.4356830 0.0000000
## xgboost_1000_4_0.1_10_All 0.069675473 0.4577616 0.1303209
With test data, the ensemble made predictions with the following summary:
## nnls auc nnloglik
## Min. :0.04924 Min. :0.09355 Min. :0.04489
## 1st Qu.:0.22404 1st Qu.:0.26939 1st Qu.:0.24257
## Median :0.32702 Median :0.35986 Median :0.34273
## Mean :0.34299 Mean :0.36699 Mean :0.35274
## 3rd Qu.:0.44372 3rd Qu.:0.45048 3rd Qu.:0.45282
## Max. :0.92495 Max. :0.87470 Max. :0.90667
The plot revealed the distribution of various error types of each method:
confusion.matrix.nnls
## $positive
## [1] "0"
##
## $table
## Reference
## Prediction 0 1
## 0 511 79
## 1 106 20
##
## $overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.74162011 0.02711675 0.70790381 0.77333130 0.86173184
## AccuracyPValue McnemarPValue
## 1.00000000 0.05593291
##
## $byClass
## Sensitivity Specificity Pos Pred Value
## 0.8282010 0.2020202 0.8661017
## Neg Pred Value Precision Recall
## 0.1587302 0.8661017 0.8282010
## F1 Prevalence Detection Rate
## 0.8467274 0.8617318 0.7136872
## Detection Prevalence Balanced Accuracy
## 0.8240223 0.5151106
##
## $mode
## [1] "sens_spec"
##
## $dots
## list()
##
## attr(,"class")
## [1] "confusionMatrix"
confusion.matrix.auc
## $positive
## [1] "0"
##
## $table
## Reference
## Prediction 0 1
## 0 525 78
## 1 92 21
##
## $overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.76256983 0.05948168 0.72965859 0.79330040 0.86173184
## AccuracyPValue McnemarPValue
## 1.00000000 0.31873806
##
## $byClass
## Sensitivity Specificity Pos Pred Value
## 0.8508914 0.2121212 0.8706468
## Neg Pred Value Precision Recall
## 0.1858407 0.8706468 0.8508914
## F1 Prevalence Detection Rate
## 0.8606557 0.8617318 0.7332402
## Detection Prevalence Balanced Accuracy
## 0.8421788 0.5315063
##
## $mode
## [1] "sens_spec"
##
## $dots
## list()
##
## attr(,"class")
## [1] "confusionMatrix"
confusion.matrix.nnloglik
## $positive
## [1] "0"
##
## $table
## Reference
## Prediction 0 1
## 0 523 77
## 1 94 22
##
## $overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.76117318 0.06517424 0.72820515 0.79197227 0.86173184
## AccuracyPValue McnemarPValue
## 1.00000000 0.22112181
##
## $byClass
## Sensitivity Specificity Pos Pred Value
## 0.8476499 0.2222222 0.8716667
## Neg Pred Value Precision Recall
## 0.1896552 0.8716667 0.8476499
## F1 Prevalence Detection Rate
## 0.8594906 0.8617318 0.7304469
## Detection Prevalence Balanced Accuracy
## 0.8379888 0.5349361
##
## $mode
## [1] "sens_spec"
##
## $dots
## list()
##
## attr(,"class")
## [1] "confusionMatrix"
For a binary classification problem with class imbalance, a predictive model may suffer Accuracy Paradox where accuracy may not be the best indicator for measuring how well a model performs. Consider the following statistics of the ensemble:
## MSE Accuracy Sensitity Specificity AllNoRecurr AllRecurr
## nnls 0.1784130 0.7416201 0.8282010 0.2020202 0.8617318 0.1382682
## auc 0.1819083 0.7625698 0.8508914 0.2121212 0.8617318 0.1382682
## nnloglik 0.1798675 0.7611732 0.8476499 0.2222222 0.8617318 0.1382682
Since the data was with class imbalance towards ‘not readmitted’. The All No Occurrence, i.e. classifying all as ‘not readmitted’, appeared with better accuracy than what were reported in Accuracy. And the ensemble seemed contributing little. On the surface, indeed.
However, for a Hospital Readmissions problem, what’s more significant is perhaps the number of FN which is a patient predicted as ‘not readmitted’ while actually being observed as ‘readmitted’ in test data. This misclassification essentially misrepresents a patient’s readiness for being released from the hospital. Prematurely releasing a patient introduces and increases the uncertainty of the patient’s health relevant to the treatment received.
In this case,
All No Occurrence, i.e. predicting all were ‘not readmitted’ would have delivered 86% accuracy, while misclassified 14% which actually readmitted.
while the ensemble had an only 74% or 76% accuracy, yet with above 20% specificity which is the percentage of correctly classifying ‘readmitted.’
In other words, the ensemble offered a better prediction for readmissions compared with those by All No Occurrence.